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Introduction : 


Every time you type a search query in a search engine how does Google display you 
precise and needed results in milliseconds even though it owns eight data centres ' 
around the world? Or when you type down notes, how spell checker finds errors and 
suggest suitable spelling suggestions for all the misspelled words even though there are 
a total of 171476 English words*. What happens behind the screen? Many daily 
processes like these use fundamental algorithms called the string matching algorithms 
to find one pattern from; the enormous datasets. The ideology of how major tasks rely 
on these simplistic and basic called strings inspired me to do this research on the 
efficiency of string matching algorithms. String matching algorithms are fundamental 
algorithms used for finding a particular pattern from a dataset. 

As we grow up, the time has become a crucial factor in our hectic lifestyles and as 
technology proceeds to develop, the amount of data stored are expanding 
simultaneously so it is essential to determine the most suitable string searching 
algorithm with a minimal amount to average runtime for processing the data. Purpose of 
this investigation is to examine how the entered pattern may influence the time 
complexity of the two algorithms. For this investigation, | will be taking two popular string 
matching algorithms the Rabin Karp algorithm and BoyerMoore algorithm to find how 


the entered pattern will affect the time complexity of both the algorithms. 


1 "Google Data Center FAQ & Locations | Data Center Knowledge." 17 Mar. 2017, 


https://www.datacenterknowledge.com/archives/2017/03/16/google-data-center-faq. Accessed 9 May. 
2019. 


? "How many words are there in the English ... - Lexico.com." https://www.lexico.com/en/explore/how- 
many-words-are-there-in-the-english-lanquage. Accessed 9 May. 2019. 


Background information : 


String matching algorithms and their applications: 


Words have performed quite a significant role in our day to day lives. Most of us do not 
know the immense importance of these tiny fundamental blocks of language. Humans 
have been using words as a communication tool for so long that no one knows when 
words were invented. Only imagining a world without words may sound highly 
disastrous. We use words everywhere and in technical terms, words can be known as a 
form of strings. Strings are known as a combination of characters. Characters are the 
alphabet, numbers, punctuation, space or symbols’. String matching algorithm also 
called string searching algorithm finds the occurrence of the pattern from the text and 
returns the position of the pattern as output. 

Ever since humankind started storing the data in string format, the problem of finding 
the string among collective data sets became a significant issue and this led to the 
discovery of various string matching algorithms. The naive string search algorithm is the 
basic string search algorithm. We perform this algorithm in our day to day life without 
recognizing the name of it. If we want to find a word we usually take the pattern which 
we need to find and infer it through the text by comparing each string of the pattern with 
the text, checking if the characters match with each other. If it matches we stop inferring 
and if it does do match, we continue inferring through the text to find where the pattern 
exists. This algorithm is simple yet inefficient .so this issue further inspired inventors like 


J Strother Moore, Robert Stephen Boyer, Michael O. Rabin ,and Richard M. Karp to 


> "Character Definition - TechTerms.” https://techterms.com/definition/character. Accessed 29 May. 2019. 


discover various string matching algorithms. String searching algorithms are becoming 
an essential part of our lives as we use it in day to day applications. String matching 
algorithms are used in spell checkers, in search engines, in plagiarism detection 
programs, in increasing field of bioinformatics for finding DNA sequences, in Digital 
Forensics, in information retrieval systems for text mining, and spam filters. All the 
string matching algorithms are divided into four types according to the way they 
approach the given data. They are classical algorithms, bit parallelism algorithms, suffix 
automata algorithms ,and hashing algorithms. For this investigation, two popular 
algorithms with different approaches will be chosen. from classic algorithms, Booyer 
Moore algorithm will be chosen as it is said to one of the oldest benchmark string 
searching algorithms as many variations have been developed lately using this as a 
base and from the hashing algorithm, Rabin Karp algorithm was chosen as it uses the 


powerful hash function to process. 


Rabin karp algorithm : 


Rabin karp was discovered by “Richard M. Karp and Michael O. Rabin during the year 
1987. Since it uses a hashing approach to the process , a hashing function is used for 
calculating hash values for all the characters in the string. Each character of the text is 
provided with a hash value .the hash values for each character present in the text and 
pattern is generated with the help of a hash function. A substring is a segment of the 
text which is taken for comparison from the existing text. The entered pattern’s hash 


value is compared with the hash value of the substring and if the hash values match, 


4 "Rabin-Karp Pattern Searching Algorithm - OpenGenus IQ." https://ig.opengenus.org/rabin-karp-string- 
pattern-searching-algorithm/. Accessed 29 May. 2019. 


the hash value of each individual character of the pattern and the hash value of each 
individual character of the substring is compared. If the hash value of the individual 
characters does not match, then the algorithm will slide over the text and choose a new 
substring which is nearby to compare. After a new substring is chosen from the text, 
again the whole process gets repeated. When the hash value of the individual 
characters of the pattern matches with the individual characters of the substring which is 
present in the text, then the pattern is considered to be found and the index values are 
returned. The use of hashing is believed to speed up the time taken for finding the 
match required. The more complex the hash functions, the more accurate matches will 
be found. 


This is the hash function which will be used in this investigation: 


trailing charecter 
t 


hash( txt[s+1 .. stm] } = ({ d ( hash( txt[s .. stm-1]) —txt[s]*h }) +txt[s +m] ) %q 


hash for next substring hash for current substring leading charecter 


Figure 2: the hash function formula 

d: represents the total number of characters present in the ASCII code. 
q: represents a prime number 

h: represents d™-® 

m: represents the pattern length 


n: represents the text length 


This formula also includes rehashing was the hash values of next substring will be 
generated with the help of hash value of current substring and the next character in the 


text. 


For example: 


if the text is :- 


the pattern is :- 


Assume the hash value of pattern is n and in each character comparison, the hash 
values of substring and pattern is compared until a match is found. Substring is denoted 


with the colour magenta. 


n# hash value of substring 


During the first comparison, the hash value of substring the does not match with n so it’s 
a mismatch. Since the mismatch takes place, next substring is taken for comparison 
from the sequence of text. The hash values of next substring = n, so that particular 
substring is taken for individual character comparison. Assume hash value of B = 13 
and hash value of C = 14 and hash values of letter B in the pattern = m and hash values 


of letter C in the pattern=k. 


First comparison: 


m=13 


Now m = 13 so it is proven during the first comparison, the first character of the 
substring is same as first character of the pattern. 


Second comparison: 


k=14 


During the second comparison, the k matches with hash value of C (k=14).so the match 


is said to be found. 


Boyer Moore algorithm : 


Boyer Moore algorithm was discovered by “Robert S. Boyer and J Strother Moore in the 
year 1977 and it is said to perform fast as the pattern length starts increasing .this 
algorithm uses classic approach to find the required pattern. There are two ways to 
approach in Boyer Moore algorithm, which are good suffix rule and bad character 
heuristics. In this investigation, bad character heuristics will be used. 

If a character of the text mismatches with a character of the pattern, that character is 
called as bad character. So the algorithm compares the pattern with the text from 
rightmost character in the pattern ,and whenever a mismatch (bad character) is spotted, 
the algorithm skips alignment until either the pattern matches with the text or until the 
entered pattern has passed over the mismatched string in the given text. 

For example: 

A text of size 14 and a pattern of pattern length 6 were taken. Whenever a mismatch 
occurs, the mismatch is marked with font colour of red. All the matches are marked with 
green. 


Since pattern was compared with the text from the rightmost character. 


5 "DAA Boyer-Moore Algorithm - javatpoint." https://www.javatpoint.com/daa-boyer-moore-algorithm. 
Accessed 29 May. 2019. 


First ‚there is a mismatch occurring at position 3 and the bad character will be A. now 
the last occurrence of the bad character (A) will be searched in the pattern and it can be 
found at position 1.now the pattern will be shifted twice so that the mismatch will 


become a match. 


0 41 2 3 4 5 6 7 8B 9 10 14 12 «13 
i c i fe B Ic le |F |B is la c |B lc | 
B Ja B 


Now the pattern gets compared from the rightmost character of the pattern. The first 
character of the pattern mismatches with the first character of the text. There is a 
mismatch at position 7.the mismatch character “F” does not occur in the pattern before 
the position seven so now the pattern shall be shifted past the position seven. After the 


position of the pattern is shifted, again comparison takes place from the rightmost 


character if the pattern. All the characters of the pattern match with the characters of the 
text so the pattern is said to be found and the index value of the position found will be 


returned as the output. 


Time complexity of the algorithms : 


In this investigation, time complexity is considered as a measure of efficiency. To find 
the most efficient algorithm among Rabin Karp algorithm and Boyer Moore algorithm, 
we will be comparing the running time of the algorithms when different variations of 
patterns are entered. Time complexity is the amount of time a code or an algorithm 
takes to run. The run time also called as execution time will be found for both the 
algorithms when different patterns are entered and this will be measured in 
nanoseconds. The best case of an algorithm occurs when a minimal amount of 
processing is needed due to the input being favourable to the optimal conditions of the 
algorithm. The worst case of an algorithm occurs when the entered input is not 
favourable to the optimal conditions of the algorithm and when the maximum number of 
processing is required. 

In the following equations represents the length of the pattern and n represents the 
length of the text. 

Time complexity of Boyer Moore algorithm®- 

Best case: O(m/n) Worst case: O(mn) 

Time complexity of Rabin Karp algorithm- 


Best case: O(m+n) Worst case: O((n-m)m) 
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The worst case in Rabin karp algorithm occurs when all the individual characters of the 
entered pattern and all the individual characters of the text are the same, as the hash 


values of the pattern will be the same as the hash value of all the substrings. 


Hypothesis : 


As the number of times the pattern occurs in the text increases, the number of times the 
algorithm needs to iterate will increase so this might cause the runtime to increase as 
the number of occurrence of pattern in the text increases for both the algorithms. As the 
position of the pattern where it can be found in the text increases, the runtime might 
increase because the algorithm might have to compare more number of times to find 
the pattern if the pattern needed is placed in the long distance from the first character 
and the runtime might decrease if the pattern needed is placed in short distance from 


the first character. 


Whenever a mismatch occurs, Boyer Moore has the benefit of skipping many 
characters of the input pattern. So as the input pattern length increases, the length of 
mismatch detected will also increase. This increase in the length of mismatch pattern 
found will cause advantage of an increase in the number of characters that can be 
skipped. Which means there is only fewer numbers of strings left to be compared when 
compared to the earlier text? In this case, the time taken to process the data will be 
reduced. 

Whereas Rabin Karp algorithm does not have the ability to skip strings, instead it scans 


every character of the given string with the text. In this case the time taken to pre- 


11 


process the algorithm may consume comparatively more time. So this would extend the 
runtime of Rabin Karp algorithm. For the second component of the experiment, due to 
these reasons | hypothesize Boyer Moore algorithm might outrun Rabin Karp algorithm 
and as the pattern length increases, the run time might increase for Rabin Karp 


algorithm but Boyer Moore will take less time to process the same pattern. 


Investigation : 


In pursuance of this investigation, the experimentation will be divided into two 
components. As the base code for both the algorithms are readily available online 
(mentioned in appendix), the base code will be taken for both the algorithm and it will be 
modified to calculate the runtime. The dataset used for both the experiments was a 
passage which consisted of information taken from a computer science resource 
website and it had 8,247 characters including space in it and has 1344 words. The text 
used will be put into array so the position of the pattern will be mentioned as index 
numbers in the upcoming explanations. To ensure the obtained runtime is highly 


precise, three trials will be taken for both the algorithms throughout the experimentation. 


The runtime was calculated by the following steps: 

The start time is declared once the entered pattern is constructed into an array and 
before the search of pattern begins. The end time is declared after the search process 
is done and when the occurrences of the patterns are found in the text. Then the total 


time is obtained by subtracting the end time with the start time. 
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public static void main(String{} args) 


t 


Sering txt @ "The vorid wide wed started around 1990/91 as a systeS of servers connect 
String pat; 


pet = "ro"; 
long startTime = System.nanoTize(): 
int q = 102; A prime zober 


search(pat, txt, g); 
long endTize = Systex.sancTime(); 


long totalTime = endTime - startTime; 
System.out.printin ( totalTime): 


} 


Figure 1.1 declaration of start time and end time in Rabin karp 


£5) 


is] 


s += max(1, j = badchar{txt{s¢3]}); 


r program t cest 
pz 


public static void main(String [Jargs) { 


char txt{] = “The world vide web started around 1990/91 as s system of servers conns 


cher pat[] © “and".teCharArrey(): 
long startTime = System. nanoTime(); 
search(cxt, pet): 
long endTime = System. nanoTime(): 
long totalTime = endTime - startTine: 
System.out.printin ( totalTime): 
} 


Figure 1.2: declaration of start time and end time in Boyer Moore algorithm 
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Variable used in the experiment : 


Dependent variable-The runtime will be the dependent variable used throughout the 
experiment for both the components. It will be calculated by measuring the difference of 
the start time and the end time of the program. It will be measured in the unit of nano 


seconds. 


Independent variable: 

Independent variable is the variable which will be changed in the experimentation. Since 
this investigation focuses on the pattern, three factors of the pattern will be changed 
throughout the experimentation. They are the number of times the pattern occurs in the 
text, the position of the pattern in the text and the length of the pattern. 

In the first component of the experimentation, the number of times a pattern occurs and 
the position of the pattern will be the independent variables. Since | couldn't find enough 
resources about whether the occurrence of a pattern multiple times in a text will affect 
the runtime and whether the position of the pattern will affect the runtime of the 
algorithms, these variables were chosen as the independent variable for the first 
component of the experiment. In the second component of the experiment, pattern 


length would be the independent variable. 
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Controlled variables: 


Since the running environment affects the run time, all the trials must be conducted on 


the same computer and multitasking must be avoided. 


Computer and the 


operating system 


The number of times the 


pattern occurred in the text 


| used Acer aspire Es 15 


Processor: Intel(R) 


laptop and windows 10 OS | Pentium(R) CPU N3710 @ 


was used. 


This was only used as a 
controlled variable in the 
second component of the 


experimentation. 


1.60GHz 

RAM memory: 2.00 GB 
where 1.83 GB was 
usable. 

System type : 64-bit 


operating system 


Since the number of times 
the pattern occurred in the 
text affected the runtime 
taken, all the patterns 
chosen for the second 
component was chosen in 
such a way that it occurred 


only once throughout the 
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ee ee cee 


Same algorithm 


Same dataset 


Integrated development 


environment used 


Same algorithm 


(mentioned in the 


appendix) used for both 


the components of the 


experimentation. 


Same data set (text)were 


used for both the 
components of the 


experiment 


Throughout the 
experimentation, all the 
programs will be run on 


same IDE. 


Java: 1.8.0_171; Java 
HotSpot(TM) Client VM 
25.171-b11 

Runtime: Java(TM) SE 
Runtime Environment 
1.8.0_171-b11 

System: Windows 10 
version 10.0 running on 


x86; Cp1252; en_IN (nb) 
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Experiment part 1: 


The first component of the experiment will focus on how the multiple occurrence of the 


pattern and position of the pattern in the text might affect the runtime. First experiment 


will be conducted using only single words. Words were taken with the pattern of length 


of 2-14, which were placed in different positions in the text. Few pairs of single words 


were taken which had same pattern length but belonged to different positions to check 


whether the change in position of string will affect the running time of the algorithms. 


Test results of the experiment part 1 : 


Peer Average time/nano seconds 

times the 

pattern was 

Pattern | repeated in 

Pattern length | the text Rabin Karp Boyer Moore 
the 3 103 7714357.33 45246319.7 
as 2 45 5271476.33 4407513 
and 3 44 4859098.67 3465778 
be 2 33 4160444.67 3092209 
web 3 29 4119374.67 2739551 
that 4 25 4052700.7 2430848.33 
use 3 23 3916265 2525992.67 
are 3 18 3702704 2106357.67 
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information 11 11 3596840.67 1533135 
from 4 10 3356547 .33 1824150 
database 8 9 3257663 1456752 
allows 6 8 3061290.33 1439322.33 
which 5 7 3030063 1589826.33 
standards 9 6 3037917.67 511249 
application 11 5 3085006.33 1236289.33 
might 5 4 2946821.67 1368333.33 
smarter 7 3 2857020.67 1243982.67 
international | 13 2 2743683 765270 
authentication | 14 1 2654117.67 1011932.67 


Table 1: processed data of experimentation part-1, describes the relationship between 
times of occurrence of the pattern in the text and the average runtime consumed by 
both the algorithms. 

From the first part of the experiment, it was evident that the position where the pattern is 
placed in the text does not affect the time taken for the algorithm to find the pattern but 
the number of times the pattern repeats in the text did affect the time taken. As the 
number of times the pattern occurred in the text increased, the run time also increased. 
The patterns which were placed in two different positions in the text, which had the 
same length, had similar runtimes. In the below table, the same font colour is used to 
denote the set of words which had the same times of occurrence and same pattern 


length. Only words which had the same times of occurrence and same pattern length 
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are extracted from the table because this way it was easy to find and show the 


relationship and compare. 


number of 
occurrences of index pattern 
the pattern Pattern number | Rabin Karp Boyer Moore | length 


Table 2: processed data of experimentation part-1, describes the relationship between 


ee 
GS hac 
ee 
ca 
ME acd 
oo hac 
Pe 
ee 
eee 
Pe 


position of the pattern in the text and the average runtime consumed by both the 


algorithms. 


In table For example if we take set of words like manipulation which has index number 
4222-navigational which has index number 5561, And credentials which has index 
number 4395 and reliability which has index number 3295 . Both the set of words have 
the same pattern length but are placed in different positions in the text but both the set 
of words took a similar amount of runtime. 

Index length is the position of the pattern in the text and it usually starts from 0.the 
words in table 1 was chosen based on their pattern length and the number of times they 


occur because these two factors will affect the running time of the algorithms. 


RabinKarp 


9000000 ~- - 
8000000 
7000000 
6000000 + 
5000000 + 
4000000 + 
3000000 + 
2000000 + 
1000000 + 


0 Ss S S S S S S S S S S S E S S E E S 


average runtime/nano second 


m 
© 
an 


MANM AaHOoOnNMOnNON FMN Ee 
et «i <1 


45 
4 
3 
2 
2 
2 


times of occurence 


Graph 1 - the graph showing the relationship between times of occurrence and average 


runtime for Rabin Karp algorithm. 
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BoyerMoore 
50000000 , 
v 
= 
8 40000000 
S 
2 30000000 
= 
pi 20000000 
€ 10000000 
- 
& o — 
Z m 
g Sete hee Sno Ce Pe eee 
oe 
times of occurence 


Graph 2 - the graph showing the relationship between times of occurrence and average 
runtime for BoyerMoore algorithm. 

The graph 1 and graph 2 clearly shows that as the times of occurrence of the pattern 
increased, the run time also increased for both the algorithms. In graph 2, there is a 
drastic increase from 45 to 103 because the interval between 45 and 103 is huge. When 
Boyer Moore is used, even small intervals have a more noticeable change in runtime as 
times of occurrence increase comparing to Rabin karp algorithm. In graph 1, the change 
from 45 to 103 is comparatively less because even though there is change in runtime of 


the change in not as vast as the change in Boyer Moore. 


Experiment part 2 : 


Second experiment will be done with a collection of words. After considering the results 
of the previous experiment, few changes were done to the variables. The position of the 
pattern was ignored since it did not cause major changes in the runtime. Set of words 


were taken for this experiment in the increasing pattern length of 30 to 102. 
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Test results of experiment part 2 : 


Pattern 

length | Rabin Karp Boyer Moore 
30 2658804 .667 9367 16.6667 
32 2665272.333 940072 

34 26606 16.333 955339.3333 
36 2658961 .333 942644 .3333 
38 2662373.333 931062.6667 
40 2654472 936641 .3333 
42 2664516.333 927201 .3333 
44 2657593.667 924798.3333 
46 2668402.333 923526 

48 2654195.667 916860.6667 
50 2667556 916292.6667 
52 2661121 929743.6667 
54 2663283.333 919067.6667 
56 2664845 920325.6667 
58 2656349 .667 901712.6667 
60 2669121 902219 

62 2652258 902640.3333 
64 2663879 .333 907973.6667 
68 2661470 902340 


22 


70 2657725 906037.6667 
72 2662532 902774 
74 2661623.333 904874 
78 2651833 902163.6667 
80 2667393 902481 .3333 
82 2653533 900329 
84 2659185 898027 
86 2660549.667 898555.3333 
88 2666174.667 896339 
90 2659447 .667 898105.6667 
92 2664940 896398.6667 
94 2677279.333 894998.3333 
96 2668155.333 893554.6667 
98 2661439 893745.6667 
100 2662282.333 892130.3333 
102 2676289 892408 


Table 3: processed data of experimentation part 2, describes the relationship between 


average runtime of both the algorithms and pattern length. 
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Graph 3 - Average runtime by pattern length graph for RabinKarp algorithm 
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Graph 4 - Average runtime by pattern length graph for Boyer Moore algorithm 

From the results obtained from the second part of the experiment, it is understandable 
that as the pattern length increased, there was not much change in the runtime for 
RabinKarp algorithm. The minor fluctuation can be caused by processors and these 
fluctuations always exist even though the running environment was maintained constant 


for every trial , so the fluctuations seems like a change in trend in graph 3 but in real life 
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these fluctuations are very minute and they would not cause much change in the trend . 
The average running time continued to fluctuate around 2600000. 

The change in pattern length did affect the time taken to process when Boyer Moore 
algorithm was used. The time taken for execution of the program decreased as the 


pattern length increased as shown in the graph 4. 


Conclusion : 


Returning to the research question “to what extent the variation in the search pattern 
may affect the efficiency of Rabin Karp algorithm and Boyer Moore algorithm in terms of 
time complexity’, it was evident from the overall results that the Boyer Moore algorithm 
outperformed Rabin Karp algorithm in all the situations. The run time of Boyer Moore 
algorithm was much faster than the Rabin Karp algorithm throughout the 
experimentation. Half of my hypothesis was correct as pattern length increased, the 
runtime taken decreased for Boyer Moore algorithm. But in Rabin Karp what | 
hypothesized was wrong as there was no change in the trend when pattern length 
increased. | was clearly wrong about run time increasing as the position of the pattern in 
the text increases as the position of the pattern did not affect the run time for both the 
algorithms . Since it was well evident from my obtained results, | was right about my 
hypothesis of runtime increasing as the number of occurrence of pattern. Still Rabin 
karp is used in various plagiarism checking programs because it is said to be more 
suitable for the application when it comes to handling multiple patterns and also it is 


uses the unique hashing approach which is not used by other major algorithms. Boyer 
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Moore algorithm can be used when it comes to handling long patterns since it takes less 


runtime to find the pattern as the pattern length increases. 


Further scope of the investigation : 


As only two string searching algorithms of different approaches(classical approach and 
hashing approach) were taken in this investigation , for further investigation | want to 
take string searching algorithms from other two approaches (which are Suffix automata 
approach and Bit parallelism approach) and compare them to find the most efficient 
string searching algorithm with less average runtime. | also want to check whether the 
trend might change for different data types like binary alphabets and DNA alphabets 
and find the most suitable string algorithm for the different data types. Since this time 
only small data set was used for the text, | want to change the data set sizes and see 


how it would affect the runtime of the different string matching algorithms. 


Limitations : 


The investigation was carefully planned so that minimal amount of error will be 
produced so there weren't much limitation as far as | know. As different people might 
use different processors and different hard wares, the runtime might be different for 
different computers as the processor speed might differ but | believe this would not 


affect the trend of relationship found between the variables. 
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Appendix : 


The code for both the algorithms were taken from a website called geeksforgeeks, and 


the idea and code to calculate the run time’ was taken from a web forum called 


stackoverflow.com where many people across the world share their ideas regarding 


queries regarding programming. 
Coding for Rabin Karp algorithm in java :° 


package rabin.karp; 


6 
7 
Bl E 
9 
10 
11 
Vv public class rabinkarp{ 
13 public static int d = 256; 
14 
15 
16 
17 e 
18 
19 static void search(String pat, String txt, int q) 
20| cE { 
| 21 int M = pat.length(); 
| 22 int N = txt.length(); 
23 int i, j; 
24 int p = 0; 
25 int t = 0; 
26 int h = 1; 
27 for (i = Bs í < M-1; i++) 
28 h = (h*d) tq? 
29 Tor (i -= Oe 2.<) MF 243+) 
30 { 
31 p = (d*p + pat.charAt (i)) tq? 
32 t = (d*t + txt.charAt (i))%taq- 
33 } 
34 for (i = 0; i <= N - M; i++) 
35 í 


7 "How to calculate the running time of my program? - Stack Overflow." 6 Mar. 2011, 


https://stackoverflow.com/questions/5204051/how-to-calculate-the-running-time-of-my-program. 
Accessed 9 May. 2019. 


"Rabin-Karp Algorithm for Pattern Searching - GeeksforGeeks.” https://www.geeksforgeeks.org/rabin- 


karp-algorithm-for-pattern-searching/. Accessed 12 May. 2019. 
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iz) fp =e) 


{ 
for: (3)= "08 3°<i Mp) 944) 
{ 
if (txt.charAt(i+j) != pat.charAt(j)) 
break; 
} 


af (a= I 
System.out.printin("Pattern found at index " + i); 

} 
Lf: (obs Rae) 
{ 

t = (d*(t - txt.charAt(i)*h) + txt.charAt (i+M))%q; 

tf te. <0) 

t= (t + aye 


} 
public static void main(String[] args) 


String txt = "The world wide web started around 1990/91 as a system of 
String pat; 

pat = "to"; 

long startTime = System.nanoTime(); 


int q = 101; // A prime number 
search(pat, txt, q); 
long endTime = System.nanoTime(); 


long totalTime = endTime - startTime; 
System.out.printin ( totalTime) ; 


servers connected over the interne 
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Coding for Boyer Moore algorithm in java: 


Start Page x | |S BoyerMoorejava x] 
[sore History | B- T-a tp 


S 2| e E| = 


6 package boyer.moore; 
7 
8 few 
9 * 
10 * @author Student 
see a 
12 public class BoyerMoore { 
13 
14 es 
15 T * @param args the line 
16 = 
27 
18 Static int NO_OF CHARS = 8243; 
19| © static int max (int a, int b) { return (a > b)? a: b; } 
20 public long startTime = System. nanoTime(); 
21 static void badCharHeuristic( char []str, int size,int badchar[]) 
22| © < 
23 int i: 
24 for (i = O; i < NO_OF_CHARS; i++) 
25 badchar[i] = -1; 
26 for (i = OF i < size: i++) 
27 badchar[(int) str[i]] = i: 
28| - } 
29 Static void search( char txt[], char pat[]) 
30| O < 
aE int m = pat.length; 
32 aint n = txt.length; 
33 
34 aint badchar[] = new int[(NO_OF CHARS]; 
35 badCharHeuristic(pat, m, badchar); 
StartPage | |sj} BoyerMoorejava Xx ` 
Source History |(@f-G-|QBseeackhr ss 
37 int s = 0; he pattern with 
38 
39 while(s <= (n - m)) 
40 { 
41 int j = m-1; 
42 while(j >= 0 && pat[j] == txt[s+j]) 
43 i=. 
44 if (j < 0) 
E { 
46 System.out.printin("Patterns occur at shift = " + s); 
47 s += (s+m < n)? m-badchar[txt[s+m]] : 1; 
48 
49 $ 
50 
51 else 
52 s += max(1l, j - badchar[txt[s+j]]); 
53 } 
54) - } 
55| © public static void main(String []args) i 
56| 
57 char txt[] = "The world wide web started around 1990/91 as a system of pervers connected over the intern 
58 char pat[] = "and".toCharArray(); 
59 long startTime = System. nanoTime(); 
60 Search(txt, pat); 
61 long endTime = System. nanoTime(); 
62 long totalTime = endTime - startTime; 
63 System.out.printin ( totalTime); 
64 
65| - } 
66) } 
€ > 


? "Boyer Moore Algorithm for Pattern Searching - GeeksforGeeks." https://www.geeksforgeeks.org/boyer- 
moore-algorithm-for-pattern-searching/. Accessed 29 May. 2019. 
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Data set used: *° 

The data set used was taken from the cited website. 

The world wide web started around 1990/91 as a system of servers connected over the 
internet that deliver static documents, which are formatted as hypertext mark-up 
language (HTML) files, which support links to other documents, but also multimedia as 
graphics, video or audio. In the beginnings of the web, these documents consisted 
mainly of static information and text, where multimedia was added later. Some experts 
describe this as a read-only web, because users mostly searched and read information, 
while there was little user interaction or content contribution. However, the web started 
to evolve into the delivery of more dynamic documents, enabling user interaction or 
even allowing content contribution. The appearance of blogging platforms as Blogger in 
1999 gives a time mark for the birth of the Web 2.0. Continuing the model from before, 
this would be the evolution to a read-write web. This opened new possibilities and lead 
to new concept as blogs, social networks or video-streaming platforms. Web 2.0 might 
also be looked at from the perspective of the websites themselves evolving in more 
dynamic and feature-rich. For instance, improved design, JavaScript and dynamic 
content loading could be considered Web 2.0 features. The internet and thus the World 
Wide Web is constantly developing and evolving into new directions and while the 
changes described for the Web 2.0 are clear to us today, the definition for the Web 3.0 
is not definitive yet. Continuing the read to read-write description form earlier, it might be 


argued that the Web 3.0 would be the read-write-execute web. One interpretation of this 


© "Option C - Web Science - cs-ib." https://www.cs-ib.net/topic/C-web-science.html. Accessed 12 May. 
2019. 
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is that the web enables software agents to work with documents by using semantic 
mark-up. This allows for smarter searches and the presentation of relevant data fitting 
into context. This is why Web 3.0 is sometimes called the semantic executive web. It is 
about user input becoming more meaningful, more semantic, by users giving tags or 
other kinds of data to their document, that allow software agents to work with the input, 
e.g. to make it more searchable. The idea is to be able to better connect information 
that is semantically connected. However, it might also be argued that the Web 3.0 is 
what some people call the Internet of Things, which is basically connecting every day 
devices to the internet to make them smarter. In some way, this also fits the read-write- 
execute model, as it allows the user to control a real life action on a device over the 
internet. Either way, the web keeps evolving and the following image provides a good 
overview and an idea where the web is heading to. However, it might also be argued 
that the Web 3.0 is what some people call the Internet of Things, which is basically 
connecting every day devices to the internet to make them smarter. It has been founded 
in 1946 and since then has published over 21000 international standards regarding 
aspects of technology and manufacturing. The members are from 163 countries 
including 3 368 technical bodies that help standards to be developed. In addition, the 
organization has over 135 people working fulltime at the central in Geneva. Experts of 
the same field work together to develop standards and these are settled on through a 
consensus process. These standards ensure safety, reliability and quality for products 
and services, while also providing a common denominator for different processes to 
communicate, e.g. for technologies. Sites that include server-side programming as well, 


usually to retrieve content dynamically from a database. This allows for data processing 
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on the server and allows for much more complex applications. In some way, this also 
fits the read-write-execute model, as it allows the user to control a real life action on a 
device over the internet. Either way, the web keeps evolving and the following image 
provides a good overview and an idea where the web is heading to. ISO is the 
International Organization of Standardization, an independent, non-governmental 
organization that develops and publishes international standards. Website logic that 
runs on the server. Common tasks include the processing of search queries, data 
retrieval from a database and various data manipulation tasks. Good examples are 
online-shops, where items are displayed based on a search query. Once the user 
decides to buy an item, server-side scripts check user credentials and make sure that 
the shop receives the order. Cookies are small files stored on a user computer. They 
hold data specific to a website or client and can be accessed by either the web server or 
the client computer. Cookies contain data values such as first-name and last-name. 
Once the server or client computers have read the cookie through their respective 
codes, the data in the cookie can be retrieved and used for a website page. Cookies are 
created usually when a new web page is loaded. Disabling cookies on your computer 
will abort the writing operation that creates cookies. However, some sites require 
cookies in order to function. Cookies are used to transport information from one session 
on a website to another. They eliminate the use of server machines with huge amounts 
of data storage, since cookies are more efficient and smaller. A database is an 
organized collection of data, which allows retrieving specific data easily based on 
queries. Data are usually organized in a way that allows the application to find data 


easily. There are different logic models of how to organize data in a database, e.g. 
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relational models, object models, navigational models and more. A database is access 
(in order to retrieve data, update them, administration) through a database management 
system (DBMS), such as for example MySQL, PostgreSQL, MongoDB, etc. .. . These 
systems usually differ in the database model that they use. XML is a flexible way to 
structure data and can therefore be used to store data in files or to transport data. It 
allows data to be easily manipulates, exported, or imported. This way, websites can 
also be designed independent from the data content. Example uses of XML are RSS 
feeds, where it is used to store data about a feed. This is a standard protocol for web 
servers to execute console programs (applications that run from the command line) in 
order to generate dynamic websites. It implements an interface for the web server (as in 
the software) to pass on user information, e.g. a query, to the application, which can 
then process it. This passing of information between the web server and the console 
application is called the CGI. Thanks to CGI, a variety of programming languages such 
as Perl, Java, C or C++ can be used, which allow for fast server-side scripting. The 
surface web is the part of the web that can be reached by a search engine. For this, 
pages need to be static and fixed, so that they can be reached through links from other 
sites on the surface web. They also need to be accessible without special configuration. 
Examples include Google, Face book, YouTube, etc. The deep web is the part of the 
web that is not searchable by normal search engines. Reasons for this include 
proprietary content that requires authentication or VPN access, e.g. private social 
media, emails; commercial content that is protected by pay walls, e.g. online news 
papers, academic research databases; personal information that is protected, e.g. bank 


information, health records; dynamic content. Dynamic content is usually a result of 
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some query, where data are fetched from a database Interoperability can be defined as 
the ability of two or more systems or components to exchange information and to use 
the information that has been exchanged. In order for systems to be able to 
communicate they need to agree on how to proceed and for this reason standards are 
necessary. Lossy compression or irreversible compression is the class of data encoding 
methods that uses inexact approximations and partial data discarding to represent the 
content. These techniques are used to reduce data size for storage, handling, and 
transmitting content. Lossless data compression algorithms usually exploit statistical 
redundancy to represent data without losing any information, so that the process is 


reversible. 


Raw data collected during the experimentation : 


In the following tables , first row of each pattern will be the runtime taken by Rabin Karp 


algorithm and second row of each pattern will be the runtime taken by Boyer Moore. 


experimentation part 1: 


Inumber of | 
times the 
pattern was Average 
pattern |repeated in time/nano 
pattern length |the text Trial 1 Trial 2 Trial 3 seconds 
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4709006 |4760364 (3753169 
4843789 |4889872 4843635 
3421454 [3498898 3476982 
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1014280 |1010439 |1011079 1011932.667 
navigational |12 1 2674637 |2676146 |2588304 2646362.333 
1019400 1011080 |1016200 1015560 
database 8 9 3283051  |3255530 |3234408 3257663 
1466145 |1423245 |1480866 1456752 
system 6 5 3069280 |2973275 |2974555 3005703.333 
1366299 |1367580 |1377180 1370353 
standards 9 6 3024477 |3060319 |3028957 3037917.667 
120135 122695 (1290917 511249 
communicate | 10 2 2785589 |2634387 |2732466 2717480.667 
1095031 |1098806 |1096991 1096942.667 
information |11 11 3554500 = |3664491 (3571531 3596840.667 
1521828 |1525668 |1551909 1533135 
reversible 10 2 2695508 = |2732588 |2730546 2719547 .333 
1080843 |71089163)1084043 24418016.33 
Experimentation part 2 : 
Average 
Pattern taken length |Trial1 |Trial2 /Trial3  jruntime 
non-governmental 2653481 2683032 2639901 2658804.667 
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more complex 


applications 90 


Communicate they need to 
agree on how to proceed and 
for this reason standards are 


necessary 92 


898587 


2638721 


899088 


2677654 


897958 


2633680 


897889 


2694352 


890067 


2678289 


898234 


2644326 


896592 


2788901 


898773 


2601038 


905427 


2664639 


898344 


2676544 


894467 


2555762 


897655 


2699430 


898027 


2660549.667 


898555.3333 


2666174.667 


896339 


2659447 .667 


898105.6667 


2664940 
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irreversible compression is 


the class of data encoding 


methods that uses inexact 
approximation 94 
include server-side 


programming as well, usuall 
to retrieve content dynamicall 


from a database 96 


these documents consisted 
mainly of static information 
and text, where multimedia 


were added later 98 


of the same field work togethe 
to develop standards and 
these are settled on through a 


consensus pro 100 


the same field work together toj102 


896778 


2632114 


895661 


2698238 


893999 


2643729 


893465 


2690032 


892243 


2730075 


896890 


2603821 


894333 


2615575 


892330 


2689594 


893440 


2600075 


890224 


2693361 


895528 


2795903 


895001 


2690653 


894335 


2650994 


894332 


2696740 


893924 


2605431 


896398.6667 


2677279.333 


894998 .3333 


2668155.333 


893554.6667 


2661439 


893745.6667 


2662282.333 


892130.3333 


2676289 


develop standards 


are settled on 


consensus process 


and these 


through a 


893541 


892111 


891572 


892408 
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